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Abstract 



Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the generalization capac- 
' ity of learning algorithms. However, apart from a few special cases, it is hard or impossible to 

calculate analytically. Vapnik et al. [10] proposed a technique for estimating the VC dimension 
empirically. While their approach behaves well in simulations, it could not be used to bound the 
generalization risk of classifiers, because there were no bounds for the estimation error of the 
■^j- ' VC dimension itself. We rectify this omission, providing high probability concentration results 

■ for the proposed estimator and deriving corresponding generalization bounds. 

m 

1 Introduction 

Statistical learning theory is fundamentally concerned with picking, out of some class of plausible 
or convenient models, ones whose predictions will be nearly optimal. Statistical optimality is most 
often demonstrated by controlling the risk, or generalization error, of predictive models, i.e., their 
^ ' expected inaccuracy on new data from the same source as that used to fit the model. The paradig- 

matic case confronts the learner with a labeled set of training examples Z = {(yi,x{), . . . , (y n , x n )} 
drawn independently from a distribution fj, over y x X. For concreteness, we take the standard 
task of pattern recognition with vector features, setting y = {0, 1} and X = W . Our contribution 
is to controlling the risk of pattern recognition when using analytically intractable models. 

Consider a class J- of possible predictors, that is a collection of functions from X to y. From 
this class, the learner uses the training set to choose some f £ J 7 , hoping to make as few errors in 
the future as possible when facing similar data. This amounts to controlling the risk of / 

R n (f)=M fl [I(Y^f(X))}, (1) 

where 1(A) is the indicator of the event A. Since the distribution \i is unknown, the risk cannot be 
calculated explicitly, so learners often proxy it by the empirical risk of /, 



1 n 

R n (J,Z) = -J2l(Y iJ kf(X i )), (2) 



n 
i=i 
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which we will abbreviate R n (f) when possible. Since (2) approximates (1), we can choose a good 
predictor / by solving 

1 n 

f = argmin -Vl(Yi + /(X*)). 

& n U 

This process is empirical risk minimization, or ERM. ERM itself is quite general, and with appro- 
priate loss functions includes ordinary least squares regression, maximum likelihood, nonparametric 
density estimation, and M-estimation. 

The next step in the statistical learning paradigm is to evaluate the performance of ERM. Is / 
consistent (in risk) for /? What is the rate of convergence? Can we control the generalization error 
of the chosen /? In fact, all of these questions are answered. Vapnik and Chervonenkis [9] gave 
necessary and sufficient conditions for uniform convergence of R n (f) to R n {f) in terms of the VC 
entropy. However, the VC entropy itself depends on the unknown distribution p. To get around 
this, we look instead at a bound for the VC entropy which is uniform over probability measures: the 
growth function, which can be calculated from the VC dimension, which is based on the shattering 
coefficient. 

Definition 1.1. LetV be some (infinite) set and let S be a finite subset o/U. Let C be a family of 
subsets o/U. We say that C shatters S if for every S' C S, 3C £ C such that S' = S n C . 

Definition 1.2 (VC dimension). The Vapnik-Chervonenkis (VC) dimension of C is 

VCD(C) := sup{ card S : S is shattered by C}. 

Application of VC dimension to classes of functions is reasonably straightforward for pattern 
recognition. To / € J 7 , associate the set Ct = {u G U : f{u) = 1}, and associate to F the class 
C T ■= {Cf.fe F}. Then define VCD (F) := VCD (C T ). 

VC dimension is just one of many ways to measure the richness or complexity of a class of 
functions. Others include covering numbers, Pseudo-dimension [3], fat-shattering dimension, and 
Rademacher complexity [2]. Heuristically, larger complexity leads to smaller minimum risk but 
higher estimation variance, and thus it is important to balance the complexity of the function class 
with the amount of data available. For VC dimension, Vapnik [8] shows that a sufficient condition 
for uniform risk consistency is that 

lim lo 8G F(„-,„) =0i 

n— >oo n 

where logGF(h,n) < h(\og (n/h) + 1) is the growth function and h* = VCD(J r ) is the VC dimen- 
sion of the function class. Furthermore, Vapnik [7, 8] proves a concentration result of the empirical 
risk around the true risk: for any p > 



sup 



R n {f) - R n (f) > p < AGF(h*,2n)exp{-np 2 } . (3) 



Similar bounds exist for other loss functions such as margin loss, loss functions constrained to a 
compact interval, or extended real- valued loss functions for regression problems. 

Given a function class F, knowing h* = VCD (J 7 ) is crucial to using these sorts of results. 
However, for many interesting function classes (support vector machines, multi-layer neural net- 
works, random forests, etc.) this knowledge is entirely unavailable. The combinatorial nature of 
VC dimension makes it very difficult to find in interesting cases. As a remedy, Vapnik et al. [10] 
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propose a way to estimate the VC dimension by simulation. While the authors showed its accuracy 
by estimating the VC dimension of linear classifiers (known to be the number of covariates with an 
extra degree of freedom for the intercept), estimated VC dimension cannot be simply plugged in to 
finite-sample concentration results (such as (3)), because the estimates themselves fluctuate around 
the true values. Since VC dimension is only useful to the extent it lets us bound generalization 
risk, this presents a problem. In this paper, we rectify this situation. 

We prove two main results. First, we show that, using the procedure of [10], the estimated VC 
dimension, h, will concentrate around the truth, h* , with high probability: 

Theorem 1.3. Let 5 > max{24ci, 29} and suppose that h* < M. Then 

h-h*\> 5) <13exp- 



16c 3 

where c\, ci, and C3 are constants given in the proof and in Table 1, and k and m are integers freely 
chosen as part of the simulation procedure. 

Second, we show that if we use the estimated VC dimension, we can still recover bounds like 
that in (3): 

Theorem 1.4. Choose 5 as in Theorem 1.3. Let p > 0. Set 

( mkc20~ 2 

^ =13exp r^r 

Then, for any classifier f £ T where T has estimated VC dimension h, we have 



(sup 



R n (f) - R n {f) > p)< 4GF(h + 5, 2n) exp{- V}(1 -<p) + <p. (4) 



The first term on the right of (4) is the same as the original bound in (3), except that the true 
VC dimension is replaced with its estimate h plus a small fudge factor 5. The second term depends 
on the confidence that we have in our estimate, through ip. The estimation procedure allows us to 
estimate h* arbitrarily well, given infinite computational time, through the choice of m and k. Of 
course this is infeasible in practice, but Theorem 1.4 allows the user to trade computational time 
for statistical accuracy. 

The remainder of this paper provides details for the proofs of our two main theorems. Section 2 
summarizes the estimation procedure developed in Vapnik et al. [10]. Section 3 proves both theo- 
rems, drawing on empirical process theory. Because there is a lot of notation, we summarize it in 
Table 1. Finally, Section 4 concludes and provides some ideas for future work. 



2 Estimation 

Vapnik et al. [10] show that the expected maximum deviation between the empirical risks of 
a classifier on two datasets can be bounded by a function which depends only on the VC di- 
mension of the classifier. In other words, given a collection of classifiers J 7 , and two data sets 
W = {(yi, xt), . . . , {y n ,x n )} and W = {(y'^x'J, . . . , (y' n ,x' n )}, we have the bound 



£(n) 



E 



sup(R n (f,W) 



Rn(f,W')) 



< < 



n log(2n/h*)+l 
°1 n/h* 



log(2n/fc*)+l 



n/h* 



n/h* < \ 

if n/h* is small 

if n/h* is large. 



(5) 
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Table 1: Constants and important notation 



Notation Meaning 



h* 
h 
M 



<P 

GF(h,n) 
c(n,M) 

L(n) 
N(v,Q) 

H(f},9) 
k, m 

ci 



C2 
C3 



the VC dimension of the function class J- 
the estimate of VC dimension via (3) 
we assume h* < M 

1 n<h/2 

else. 



a' (f -a") 

1 + iiW + 1 



0.16 
1.2 

0.14927 
13 exp 



16c 3 

< h{\og(n/h) + 1) 

{Lipschitz-like constants such that Vn: 
c(n, M)|/i - h'\ < \$ h (n) - $ h '(n)\ < L(n)\h - h'\ 

the 77-covering number of Q 
the ^-entropy of Q 

integers chosen for the simulation in Algorithm 1 

(c' + 1/4) ^/\og{Ad + 1) - ^erfi(V47Tl) 

o 

k 

2304 



We can bound (5) by <£v(n), viewed as a function of n and parametrized by h: 



a'(?-a") 

1 + t4W + 1 



n < h/2 
else. 



(6) 



Here the constants a = 0.16, a' = 1.2 were determined numerically in [10] to adjust the trade-off 
between "small" and "large" in (5), and a" = 0.14927 was chosen so that $(0.5) = 1 (this choice 
depends only on a and a"). Furthermore, the bound is tight. Since (6) is known up to h, we can 
estimate it given knowledge of the maximum deviation on the left side of (5). Of course, we do not 
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Algorithm 1 Generate £(n^) 

Given a collection of possible classifiers T and a grid of design points 
ni,...,nfc, generate £(n^). Repeat the procedure at each design point, ri£, m 
times. 

1: Generate a data set from the same sample space y x X as the training sample that is independent 

of the training sample. The generated set should be of size 2nf. {(yi, X\), . . . , (y2n e ,%2n e )}- 
2: Split the data set into two equal sets, W and W . 
3: Flip the labels (y values) of W . 

4: Merge the two sets and train the classifier simultaneously on the entire set: W with the "correct" 

labels and W with the "wrong" labels. 
5: Calculate the training error of the estimated classifier / on W with the 'correct' labels and on 

W using the "correct" labels. 
6: Set £i(n e ) = \RnAlW) -tkt&W) |. 

7: Set £ta) = i££i 



have such knowledge, but we can generate observations 

£(n) = $ h (n) + e(n) 

at design points n. Here e is mean zero noise (since the bound is tight) having an unknown 
distribution with support on [0,1]. Given enough such observations at different design points 
ri£, we can then estimate the true VC dimension h* using nonlinear least squares. Of course, 
generating £(n^) is nontrivial. Vapnik et al. [10] give an algorithm for generating the appropriate 
observations. Essentially, at each (fixed) design point : I G {1, . . . , k}, we simulate m data points 
(£i(ra^), <$>h{ni)), for i = 1, . . . , m, so as to approximate £(n^) as defined in (5). This procedure is 
shown in Algorithm 1. Vapnik et al. [10] show that this algorithm works well in practice, recovering 
the known VC dimension of linear classifiers (p + 1 for p explanatory variables and an intercept) 
and demonstrating that the method for generating the dataset does not affect the algorithm's 
performance. 1 In the next section, we prove our main result, showing that in fact, the estimate 
concentrates around the truth with high probability. 

3 Proof of results 

We now prove Theorem 1.3 and Theorem 1.4. The proofs draw heavily on the empirical process 
techniques of van de Geer [5] and van de Geer [6]; however, those works ignored constants, and 
made stronger assumptions than necessary for the case at hand. We strive to make our results as 
self-contained as possible, appealing to [6] only for the proof of Corollary 3.5. 

Our goal is to show that the estimated VC dimension h is close to the true dimension h* . This 
will mean showing that is close to <3?h* when averaged over the design points rig. It will be 

There are of course ways to generate data in so that this procedure will fail, e.g., generating the data with 
too-regular determinism, or with dependence. We refer the cautious reader to Vapnik et al. [10]. We also return to 
this point at the end of §3. 
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convenient to introduce a norm and inner product for functions g : 

1 k 



=1 



1 

(e,flOfc = - ^2e{n e )g(n e ). 



So we take as our estimate of h* 



h = argmin 

he[o,M] 



and our immediate goal is control over 1 — &h* \\ k - 

For every / E T and every dataset Z, R n (f,W) is bounded between and 1. Therefore, the 
residuals e{ng) are also in [0, 1]. In fact, we can show that they are subgaussian. 



Lemma 3.1. At all design points n, 

E[exp{te(n)}] < exp{t 2 /8m}. 
Proof. By a standard Hoeffding type argument, we have that 

E[exp{tei(n)}] < exp{t 2 /8}. 

Therefore 



(7) 



E[exp{te(n)}] = E 



~[exp{tei(n)/n} 



i=l 



< 



exp 



mt 
m 2 S 

2 



exp{i /8m}. 



The next step is to show that we can control weighted averages of the e(ng). 



□ 



Lemma 3.2. Suppose eg := e(ng) are random variables satisfying (7). Then for any 7 6 IR fc and 

P > 0, 

2m/9 2 



=1 



> p < 2 exp 



eLiti 



Proof. Using a Chernoff bound, we have, for t > 

/ fc \ 



e ^ > pj < exp |-tp + ^ 
Amp 

t=l1t 



8m 



£=1 



Taking 



minimizes the right hand side. The same argument applies for — ^£=1 so a union bound gives 
the result. □ 
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In order to state our result about 1 — <&h* \ \ k , we must specify the complexity of the function 
class Q := {&h : < h < M}, which we will measure with its entropy. 

Definition 3.3. The functions gi, ■ ■ ■ g n are an f?-cover of Q if every g S Q is within n of some gj, 
\\g ~ gj\\k — V- The 77-covering number N(r],Q) is the cardinality of the smallest r]-cover (or 00 if 
there isn't one). The r/-entropy is the log of the covering number, H(jj,Q) = \ogN(j],Q). 

While it may seem excessive to use covering numbers and entropy to deal with a function 
class parametrized by a scalar, doing so lets us get much tighter bounds than would otherwise be 
possible. The key to our argument will be the entropy of the restricted class Q{j) := £ Q : 
||*-$h*||fc < r}. 

Lemma 3.4. 

ff(r ? ,g(r))<logf 4T/c/ + T? 
where d is defined below. 

Proof. <I>/j is bounded and differentiable in h and therefore Lipschitz with constants L(n). Thus 

k 

||$fr-*fc'||fc < \Y,L 2 (nd\h-ti\. 

1=1 

Set d = \^l = \ L 2 {ni) . Covering a r ball around in the ||-|| fe metric is then equivalent to 
covering a r/d ball around h in the Euclidean metric. It is well known (cf. [6]) that 



H(rj,B(r/d))< log 



4r/c' + r] 



□ 



The remaining proofs rely on the peeling device. Intuitively, the idea is that considering the 
entropy of larger and larger balls centered around will allow us to "peel" off sets of increasingly 
smaller probability. This peeling argument is critical to our proof that ||$^ — ^/i*)^ is small with 
high probability. 

To use peeling here, define d(h) := \ \&h — ^h*\\k an< ^ consider a strictly increasing sequence v s , 
starting with vq = but growing to 00. We can peel Q into Q = U^Li Gs, where 

g s = {<S> h eg : v a -! < d{h) < v s }. 

Then we have that for any p > 0, and our residuals e (which implicitly depend on the choice of 
g ■= $h), 

P(supi>p] < |> ( sup -|-L > p ) <f>( sup |e|>^ s _i). 

This lets us get probability inequalities for the weighted process from probabilities for the original 
process. We will want to allow the weights 7^ in Lemma 3.2 to depend on functions <&/!, and in 
particular investigate the behavior of the worst-case h. Taking v s = 2 s for s > and v = will 
allow us to derive an important corollary to Lemma 3.2 as well as Theorem 3.6. 
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Choosing v s this way means that it is not enough to control the covering number of the entire 
function class Q, but rather we must cover a sequence of restricted classes Q(t) with smaller and 
smaller balls. Therefore, we will need the entropy sum, 



J(r) :=Y j 2- s r^H{2^TMr))- 

s=l 

which is bounded by the entropy integral, 

J(r) < 2 / du^fH{u,g{T) 
Jo 

(see [6, p. 29]). Lemma 3.4 implies that 2 

J(r) < 2r f dv^\og{\ + 4v/c') < 2c x r. 
Jo 

Finally, we can prove an important corollary to Lemma 3.2. The proof makes use of the entropy 
integral as well as the peeling device, and it follows from Lemma 3.2 in van de Geer [6], so we 
provide only the necessary adjustments in our proof here. However, we will need both the peeling 
device and the entropy integral again in the proof of Theorem 3.6. 

Corollary 3.5 (Corollary of Lemma 3.2). // sup 9g g \\g\\ k < r and (7) holds for all design ng, then 
for all 

S > / max{24ci,29}, 
y2A;m 



we have, as a consequence of Lemma 3.2, 

k 



P sup 



k 
i=i 



> 8 < 4 exp 



km5 

C 3 T 2 



where C\ is as above and C3 = 2304. 



Proof. The proof is given in Lemma 3.2 in van de Geer [6]. In our case, the entropy integral 
converges, so we may take K = 00 in that proof. Furthermore, we replace equation (3.3) there with 
the result of Lemma 3.2 here, and set e = 5/2. □ 

Theorem 3.6. Suppose that h is the solution to (3). Let 

4 

5 > —== max{24ci, 29}. (8) 
V 2mk 



Then, 



F (||^-^ll fe >^)<13exp|- — ] 



2 In fact, ci = (c + l/4)- v /log(4c / + 1) - ^erfi(V4c' + 1), where erfi is the imaginary error function. Despite the 
adjective "imaginary", ci is always real. 
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II i 1 2 

Proof. First, note that — < £v|| fc < 2(u),<£^ — $h*)fc- Then we use peeling and the lemmas 
above: 

p (ll*ji-M fc >*) 
00/ \ 



<VP sup (u>,$ h -$ ft .)* >2 2s - 1 5 2 

s=0 y^Gi^s) 



s=0 



Now Vs > 0, 



2 2s-i 6 2 > Z2= max {24 Cl ,29} 
v 2fcm 



by (8), therefore, we can apply Corollary 3.5 to each P s . This gives 

ik2 4s ~ 2 5 A 



^F s <^4exp|--^ T ^) 

00 

= ^4exp 







=0 



C3 

4exp — i^T + 4exp Z7T~ + £ 4exp 



f mM 2 ] 


> + 4 exp < 


mk5 2 } 


I 16c 3 J 


I 4c 3 j 


f mk8 2 } 


> + 4 exp < 


mk5 2 } 


I 16c 3 J 


I 4c 3 j 



=0 



4 exp <J — — — J> + 4 exp <j J> + 4 ( 1 - exp <J j j exp 



2mM 2 2 s 

C3 

2mk5 2 1 \ 1 f 2mk5 2 



Then, by condition (8), we have that 



4 ( 1 - exp <j ■ — } ) < 5 



and the first exponential is the largest so we have the result. □ 

Finally, we can use the Lipschitz behavior of the function combined with the bound h* < M 
to derive our main result. 

Proof of Theorem 1.3. The function &h(n) is well behaved. In particular, we have that for some 
c(n,M), 

c(n,M)\h-ti\ < |*ft(n)-* v (n)| 

for all h, h! < M and every n. This is easily verified, though it is necessary to calculate c(n, M) 
numerically. Therefore, 



\h-ti\ 



|> ( „ ( ,M )£ -L 



So setting C2 = \ Z~2e=i c2 ( n ^> M) and applying Theorem 3.6 gives the result. □ 
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Proof of Theorem 1.4- Define A = {supy 6 jr \R n (f) — R n (f)\ > p} and B = {h* < h + 5}, then we 
are interested in controlling P(^4). By the law of total probability, we have 3 

F(A) = F(A | B)F(B) + F(A \ B C )F(B C ) 
< F(A | B)F(B) + F(B C ) 

= 4GF(h + 5, 2n) exp{-np 2 }(l - <p) + <p 

□ 

4 Discussion 

In this paper, we showed how to derive generalization error bounds from the estimated rather than 
actual VC dimension of a function class T . Our method uses the simulation procedure proposed 
by Vapnik et al. [10] for the estimates. Empirical process theory for nonparametric least squares 
regression shows that these estimates h concentrate around the truth h* with high probability. 
The resulting bounds can be used for model selection as well as to characterize the finite-sample 
predictive ability of the model / chosen through empirical risk minimization. 

The algorithm outlined here is not the only way to estimate VC dimension. Shao et al. [4] 
modify Algorithm 1 in light of ideas from experimental design, varying the number of replications 
m with the design point rig, and show that this improves the estimates of the VC dimension. 
Modifying our empirical process techniques to use this improved estimator would be desirable, but 
the extension is nontrivial. 

As mentioned in the introduction, there are many other methods for measuring the richness of 
a model class. Rademacher complexity in expectation is difficult or impossible to calculate, but it 
has an obvious empirical counterpart for which concentration results already exist thereby allowing 
for tight data-based generalization error bounds. However, Rademacher complexity cannot be used 
with unbounded loss functions. VC dimension, while discussed here in the context of classification, 
generalizes to regression problems with unbounded loss as long as appropriate moment conditions 
are satisfied. Hence, our technique will apply in these settings as well. Indeed, since VC dimension 
is a property of the class of prediction functions and not the data-generating process, and finite 
VC dimension has recently [1] been shown to characterize learning from ergodic sources, it may be 
possible to use our procedure as part of an algorithm for bounding prediction risk on dependent 
data. 
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